Search CORE

13 research outputs found

Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Author: Dukler Yonatan
Gu Quanquan
Montúfar Guido
Publication venue
Publication date: 11/06/2020
Field of study

The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization performance and speed up training significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting, where the functions under consideration do not exhibit the properties of commonly normalized neural networks. In this paper, we bridge this gap by giving the first global convergence result for two-layer neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. Our analysis shows how the introduction of normalization layers changes the optimization landscape and can enable faster convergence as compared with un-normalized neural networks.Comment: To be presented at ICML 202

arXiv.org e-Print Archive

eScholarship - University of California

Learning Expressive Prompting With Residuals for Vision Transformers

Author: Das Rajshekhar
Dukler Yonatan
Ravichandran Avinash
Swaminathan Ashwin
Publication venue
Publication date: 27/03/2023
Field of study

Prompt learning is an efficient approach to adapt transformers by inserting learnable set of parameters into the input and intermediate representations of a pre-trained model. In this work, we present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT). Out method constructs downstream representations via learnable ``output'' tokens, that are akin to the learned class tokens of the ViT. Further for better steering of the downstream representation processed by the frozen transformer, we introduce residual learnable tokens that are added to the output of various computations. We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark. In addition to strong performance, we observe that our approach is an order of magnitude more prompt efficient than existing visual prompting baselines. We analytically show the computational benefits of our approach over weight space adaptation techniques like finetuning. Lastly we systematically corroborate the architectural design of our method via a series of ablation experiments.Comment: Accepted at CVPR (2023

arXiv.org e-Print Archive

A theory for undercompressive shocks in tears of wine

Author: Bertozzi Andrea L.
Dukler Yonatan
Falcon Claudia
Ji Hangjie
Publication venue
Publication date: 01/03/2020
Field of study

We revisit the tears of wine problem for thin films in water-ethanol mixtures and present a new model for the climbing dynamics. The new formulation includes a Marangoni stress balanced by both the normal and tangential components of gravity as well as surface tension which lead to distinctly different behavior. The prior literature did not address the wine tears but rather the behavior of the film at earlier stages and the behavior of the meniscus. In the lubrication limit we obtain an equation that is already well-known for rising films in the presence of thermal gradients. Such models can exhibit non-classical shocks that are undercompressive. We present basic theory that allows one to identify the signature of an undercompressive (UC) wave. We observe both compressive and undercompressive waves in new experiments and we argue that, in the case of a pre-coated glass, the famous "wine tears" emerge from a reverse undercompressive shock originating at the meniscus

arXiv.org e-Print Archive

eScholarship - University of California

Wasserstein Diffusion Tikhonov Regularization

Author: Dukler Yonatan
Li Wuchen
Lin Alex Tong
Montufar Guido
Publication venue
Publication date: 15/09/2019
Field of study

We propose regularization strategies for learning discriminative models that are robust to in-class variations of the input data. We use the Wasserstein-2 geometry to capture semantically meaningful neighborhoods in the space of images, and define a corresponding input-dependent additive noise data augmentation model. Expanding and integrating the augmented loss yields an effective Tikhonov-type Wasserstein diffusion smoothness regularizer. This approach allows us to apply high levels of regularization and train functions that have low variability within classes but remain flexible across classes. We provide efficient methods for computing the regularizer at a negligible cost in comparison to training with adversarial data augmentation. Initial experiments demonstrate improvements in generalization performance under adversarial perturbations and also large in-class variations of the input data

arXiv.org e-Print Archive

eScholarship - University of California

SAFE: Machine Unlearning With Shard Graphs

Author: Achille Alessandro
Bowman Benjamin
Dukler Yonatan
Golatkar Aditya
Soatto Stefano
Swaminathan Ashwin
Publication venue
Publication date: 25/04/2023
Field of study

We present Synergy Aware Forgetting Ensemble (SAFE), a method to adapt large models on a diverse collection of data while minimizing the expected cost to remove the influence of training samples from the trained model. This process, also known as selective forgetting or unlearning, is often conducted by partitioning a dataset into shards, training fully independent models on each, then ensembling the resulting models. Increasing the number of shards reduces the expected cost to forget but at the same time it increases inference cost and reduces the final accuracy of the model since synergistic information between samples is lost during the independent model training. Rather than treating each shard as independent, SAFE introduces the notion of a shard graph, which allows incorporating limited information from other shards during training, trading off a modest increase in expected forgetting cost with a significant increase in accuracy, all while still attaining complete removal of residual influence after forgetting. SAFE uses a lightweight system of adapters which can be trained while reusing most of the computations. This allows SAFE to be trained on shards an order-of-magnitude smaller than current state-of-the-art methods (thus reducing the forgetting costs) while also maintaining high accuracy, as we demonstrate empirically on fine-grained computer vision datasets

arXiv.org e-Print Archive

Your representations are in the network: composable and parallel adaptation for large scale models

Author: Achille Alessandro
Bowman Benjamin
Dukler Yonatan
Fowlkes Charless
Ravichandran Avinash
Soatto Stefano
Swaminathan Ashwin
Vivek Varsha
Yang Hao
Zancato Luca
Publication venue
Publication date: 31/10/2023
Field of study

We propose InCA, a lightweight method for transfer learning that cross-attends to any activation layer of a pre-trained model. During training, InCA uses a single forward pass to extract multiple activations, which are passed to external cross-attention adapters, trained anew and combined or selected for downstream tasks. We show that, even when selecting a single top-scoring adapter, InCA achieves performance comparable to full fine-tuning, at a cost comparable to fine-tuning just the last layer. For example, with a cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve performance within 0.2% of the full fine-tuning paragon at a computational training cost of 51% of the baseline, on average across 11 downstream classification. Unlike other forms of efficient adaptation, InCA does not require backpropagating through the pre-trained model, thus leaving its execution unaltered at both training and inference. The versatility of InCA is best illustrated in fine-grained tasks, which may require accessing information absent in the last layer but accessible in intermediate layer activations. Since the backbone is fixed, InCA allows parallel ensembling as well as parallel execution of multiple tasks. InCA achieves state-of-the-art performance in the ImageNet-to-Sketch multi-task benchmark.Comment: Accepted to NeurIPS 202

arXiv.org e-Print Archive

Theory for undercompressive shocks in tears of wine

Author: Dukler Yonatan,
Publication venue
Publication date: 16/06/2023
Field of study

Ezid

Wasserstein of Wasserstein Loss for Learning Generative Models

Author: Dukler Yonatan,
Publication venue
Publication date: 14/07/2020
Field of study

Ezid

Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Author: Dukler Yonatan,
Publication venue
Publication date: 14/07/2020
Field of study

Ezid

Recommended from our members

Part I: The geometry and manipulation of natural data for optimizing neural networks Part II: A theory for undercompressive shocks in tears of wine

Author: Dukler Yonatan
Publication venue: eScholarship, University of California
Publication date: 01/01/2021
Field of study

In Part I of the thesis, we present a body of work analyzing and deriving data-centric regularization methods for the effective training of machine learning models. Machine learning and deep learning in particular have been highly successful in computer vision and generative modelling in recent years. Nonetheless, the progress of such approaches crucially relies on effective regularization, architectural, and algorithmic choices that are often abstracted away during a first consideration. In this part we present the reader with effective regularization approaches focused on the geometry and biases of natural data and parameterization of deep neural networks. We start by deriving a regularization to accurately capture geometric robustness and natural variances of images in Chapter 1. This approach enables significant improvement in model robustness and relies on the theory of optimal transport which we introduce alongside with our method in the chapter. Dataset regularization is extended to active manipulation of the sampling distribution as opposed to each datum in Chapter 2. In the chapter, we present a general and differentiable technique for dataset optimization enabling de-biasing of noisy and imbalanced datasets. In our final contribution for Part I, In Chapter 3, we study the interplay between data and model parameterization. This concerns with the widely-spread architectural approach of neural network normalization. We analyze the convergence dynamics of Weight Normalization and present the first proof of global convergence for dynamically normalized ReLU networks when trained with gradient descent.In Part II, we study the fluid dynamics phenomena known as the tears of wine problem for thin films in water-ethanol mixtures and present a model for the climbing dynamics. The new formulation includes a Marangoni stress balanced by both the normal and tangential components of gravity as well as surface tension which lead to distinctly different behavior. The prior literature did not address the wine tears but rather the behavior of the film at earlier stages and the behavior of the meniscus. In the lubrication limit we obtain an equation that is already well-known for rising films in the presence of thermal gradients. Such models can exhibit nonclassical shocks that are undercompressive. We present basic theory that allows one to identify the signature of an undercompressive wave. We observe both compressive and undercompressive waves in new experiments and we argue that, in the case of a preswirled glass, the famous “wine tears” emerge from a reverse undercompressive shock originating at the meniscus

eScholarship - University of California